-
Notifications
You must be signed in to change notification settings - Fork 16
Pd support #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pd support #94
Conversation
…ecode fields Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Update config_test to use a function to create configuration same as defined in the config yaml file Signed-off-by: Maya Barnea <mayab@il.ibm.com>
change command line argument name to 'kv-cache-transfer-latency' Signed-off-by: Maya Barnea <mayab@il.ibm.com>
) | ||
|
||
DescribeTable("invalid configurations", | ||
func(args []string) { | ||
_, err := createSimConfig(args) | ||
Expect(err).To(HaveOccurred()) | ||
}, | ||
Entry(tests[7].name, tests[7].args), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You removed a test here instead of only increasing the indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added test 13
|
||
return c | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing, and one function is enough. There is only one case for createDefaultBasicConfig, and we can just update the lora parameters in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basicConfig function was removed
// isDoRemoteDecode() returns true is do_remote_decode is true in the request, this means that this is prefill request | ||
doRemoteDecode() bool | ||
// isDoRemotePrefill() returns true is do_remote_prefill is true in the request, this means that this is decode request | ||
doRemotePrefill() bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names in the comments don't match the actual names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/llm-d-inference-sim/request.go
Outdated
RemoteBlockIds []string `json:"remote_block_ids"` | ||
RemoteEngineId string `json:"remote_engine_id"` | ||
RemoteHost string `json:"remote_host"` | ||
RemotePort int `json:"remote_port"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding comments for the fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
pkg/llm-d-inference-sim/response.go
Outdated
RemoteBlockIds []string `json:"remote_block_ids"` | ||
RemoteEngineId string `json:"remote_engine_id"` | ||
RemoteHost string `json:"remote_host"` | ||
RemotePort int `json:"remote_port"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@@ -304,6 +306,8 @@ func (s *VllmSimulator) readRequest(ctx *fasthttp.RequestCtx, isChatCompletion b | |||
var req textCompletionRequest | |||
|
|||
err := json.Unmarshal(ctx.Request.Body(), &req) | |||
|
|||
fmt.Printf("Unmarshaled text request: %#v\n", req) | |||
return &req, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug printing?
pkg/llm-d-inference-sim/simulator.go
Outdated
@@ -477,16 +491,23 @@ func (s *VllmSimulator) reqProcessingWorker(ctx context.Context, id int) { | |||
isChatCompletion: reqCtx.isChatCompletion, | |||
model: displayModel, | |||
}, | |||
responseTokens, toolCalls, finishReason, usageDataToSend, | |||
responseTokens, toolCalls, finishReason, usageDataToSend, req.doRemotePrefill(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add doRemotePrefill to streamingContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, added
pkg/llm-d-inference-sim/simulator.go
Outdated
@@ -638,6 +671,14 @@ func (s *VllmSimulator) sendResponse(isChatCompletion bool, ctx *fasthttp.Reques | |||
s.responseSentCallback(modelName) | |||
} | |||
|
|||
// returns time to first token based on whether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on whether what? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
README.md
Outdated
@@ -29,9 +29,9 @@ The simulator supports two modes of operation: | |||
- `echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used. | |||
- `random` mode: the response is randomly chosen from a set of pre-defined sentences. | |||
|
|||
Timing of the response is defined by two parameters: `time-to-first-token` and `inter-token-latency`. | |||
Timing of the response is defined by `time-to-first-token` and `inter-token-latency` parameters. In case P/D is enabled for a request, `kv-cache-transfer-latency` will be used instead of `time-to-first-token`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace: is defined by `time-to-first-token
with: is defined by the time-to-first-token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
README.md
Outdated
|
||
For a request with `stream=true`: `time-to-first-token` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream. | ||
For a request with `stream=true`: `time-to-first-token` or `kv-cache-transfer-latency` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream. | ||
|
||
For a requst with `stream=false`: the response is returned after delay of `<time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wasn't kv-cache-transfer-latency mentioned here?
@@ -616,8 +649,8 @@ func (s *VllmSimulator) createCompletionResponse(isChatCompletion bool, respToke | |||
// finishReason - a pointer to string that represents finish reason, can be nil, stop, length, or tools | |||
// usageData - usage (tokens statistics) for this response | |||
func (s *VllmSimulator) sendResponse(isChatCompletion bool, ctx *fasthttp.RequestCtx, respTokens []string, toolCalls []toolCall, | |||
modelName string, finishReason string, usageData *usage) { | |||
resp := s.createCompletionResponse(isChatCompletion, respTokens, toolCalls, &finishReason, usageData, modelName) | |||
modelName string, finishReason string, usageData *usage, doRemoteDecode bool, doRemotePrefill bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names doRemoteDecode and doRemotePrefill are strange. Shouldn't they be something like doPrefillOnly and doDecodeOnly, respectively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do-remote-decode
and do-remote-prefill
are field names of vLLM's request/response
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
No description provided.